-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support WDL call caching #5105
Support WDL call caching #5105
Conversation
…RIs as URIs for import
I need to make this also support caching calls to whole workflows. If I use the files in the output tasks' caches instead of fetching the cached results of one workflow, I get different paths (that symlink to the same files), and then I can't re-use MiniWDL cached calls that depend on files from the called workflow. |
Also fix linkImports from workers, and remove the weird contect manager hardlink setup that didn't appear to work anyway.
I think we might run into problems with sibling files with WDL call caching when reading caches written by MinIWDL. MiniWDL will lay out task outputs by output name, and then reference those locations in the cache files. So files that were siblings won't be anymore when loaded from MiniWDL's task output directories. I think we might just have to let that not work. When Toil writes files into the cache to reference them it should obey the sibling files constraint because it uses the same code as when writing them for a task to pick paths. |
https://ucsc-ci.com/databiosphere/toil/-/jobs/79122 got a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks alright to me. Is there a way a test could be added?
@@ -267,7 +268,7 @@ def _runStep(self): | |||
if self.checkOnJobs(): | |||
activity = True | |||
if not activity: | |||
logger.debug('No activity, sleeping for %is', self.boss.sleepSeconds()) | |||
logger.log(TRACE, 'No activity, sleeping for %is', self.boss.sleepSeconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go back to debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I lowered this to TRACE since I didn't think it made sense to log every second at debug level.
src/toil/utils/toilStatus.py
Outdated
@@ -49,14 +49,14 @@ def print_dot_chart(self) -> None: | |||
|
|||
# Make job IDs to node names map | |||
jobsToNodeNames: Dict[str, str] = dict( | |||
map(lambda job: (str(job.jobStoreID), job.jobName), self.jobsToReport) | |||
map(lambda job: (str(job.jobStoreID), str(job.jobStoreID).replace("_", "___").replace("/", "_").replace("-", "__")), self.jobsToReport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... name_/something
seems equivalent to name/_something
after conversion. Is that alright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point, I didn't think about adjacent sequences.
src/toil/wdl/wdltoil.py
Outdated
imported = file_dest.import_file(candidate_uri, check_existence=False) | ||
else: | ||
# The file store import_file doesn't do an existence check. | ||
# TODO: Have a more compatible interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way?
@stxue1 Can you maybe look at this? I tried to merge my code that tacks a shared filesystem path onto every File that has one with your code that defers virtualization to task boundaries, but I know I'm missing at least one route that files can come in, because I don't think I am managing to add the shared filesystem path when a workflow-level string is coerced to a file and then virtualized. Where is that happening, and at that point do we have a good way to distinguish a leader-filesystem file from a worker-filesystem file? Also, I went through and turned all the |
The virtualization for when a workflow-level string being coerced to a file takes place in two areas. On task boundaries, virtualization is handled when I think the two places that matter is in the
The mutable replacements should be fine, the metadata they copy over includes the necessary information for virtualization. |
I suppose there isn't currently a very good way to distinguish if the file.value is pointing to a worker filesystem file or a leader-filesystem file. Any use of a file will automatically try and read from the virtualized value, creating it if it doesn't exist (and thus uploading it to the jobstore), then changing the file.value field to the correlated devirtualized path. This behavior is left over from before when we read virtualized values that were set in the file.value field. Perhaps we should change it so the file.value field always points to a path that exists on the associated worker/leader. My main concern is figuring out where exactly to do this without breaking path substitution support. In theory, I think at the beginning of every job should be fine. |
) | ||
|
||
# Apply the shared filesystem path to the virtualized file | ||
set_shared_fs_path(virtualized_file, exported_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing an assignment?
Steven's File FactsThe original string value is usually stored as the In the task wrapper the original string path is used. JSON-based inputs get stored as Strings coerced inside task output sections appear as |
… importing, and remove banned mutation
This adds support for WDL call caching compatible with MiniWDL, controlled by the same MiniWDL config/environment variables as MiniWDL uses. This should fix #4797.
It only caches task calls, and the
write_*
WDL function results necessary to have calls that depend on workflow-generated files.It copies all output files from a Toil step into the MiniWDL cache folder, when saving to the cache, since Toil doesn't usually generate a persistent copy of task output files.
It does not cache downloads like MiniWDL can, so tasks that depend on URL files probably can't cache (or might get stuck cached and not update when the URL content changes).
It doesn't do anything special for string to File coercion, but probably should.
If you have this workflow:
Then you can run it with either of:
Or
And the second one will get cache hits for all the tasks that the first one ran.
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thing
in the Toil repo, or from an external repo.camelCase
that want to be insnake_case
.docs/running/{cliOptions,cwl,wdl}.rst
Merger Checklist